Characterization for MapReduce on the Cloud
نویسندگان
چکیده
MapReduce is now a pervasive analytics engine on the cloud. Hadoop is an open source implementation of MapReduce and is currently enjoying wide popularity. Hadoop offers a high-dimensional space of configuration parameters, which makes it difficult for practitioners to set for efficient and cost-effective execution. In this work we observe that MapReduce application performance is highly influenced by map concurrency. Map concurrency is defined in terms of two configurable parameters, the number of available map slots and the number of map tasks running over the slots. We show that some inherent MapReduce characteristics enable well-informed prediction of map concurrency. We propose Map Concurrency Characterization (MC), a standalone utility program that can predict the best map concurrency for any given MapReduce application. By leveraging the generated predicted information, MC can judiciously guide Map phase configuration and, consequently, improve Hadoop performance. Unlike many of relevant schemes, MC does not employ simulation, dynamic instrumentation, and/or static analysis of unmodified job code to predict map concurrency. In contrast, MC utilizes a simple, yet effective mathematical model, which exploits the MapReduce characteristics that impact map concurrency. We implemented MC and conducted comprehensive experiments on a private cloud and on Amazon EC2 using Hadoop 0.20.2. Our results show that MC can correctly predict the best map concurrencies for the tested benchmarks and provide up to 2.2X speedup in runtime.
منابع مشابه
Cloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming
The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...
متن کاملCommunication-Aware Traffic Stream Optimization for Virtual Machine Placement in Cloud Datacenters with VL2 Topology
By pervasiveness of cloud computing, a colossal amount of applications from gigantic organizations increasingly tend to rely on cloud services. These demands caused a great number of applications in form of couple of virtual machines (VMs) requests to be executed on data centers’ servers. Some of applications are as big as not possible to be processed upon a single VM. Also, there exists severa...
متن کاملA Model-driven Approach for Price/Performance Tradeoffs in Cloud-based MapReduce Application Deployment
This paper describes preliminary work in developing a modeldriven approach to conducting price/performance tradeo s for Cloudbased MapReduce application deployment. The need for this work stems from the signi cant variability in both the MapReduce application characteristics and price/performance characteristics of the underlying cloud platform. Our approach involves a model-based machine learn...
متن کاملA MR Simulator in Facilitating Cloud Computing
MapReduce is an enabling technology in support of Cloud Computing. Hadoop which is a mapReduce implementation has been widely used in developing MapReduce applications. This paper presents Hadoop simulatorHaSim, MapReduce simulator which builds on top of Hadoop. HaSim models large number of parameters that can affect the behaviors of MapReduce nodes, and thus it can be used to tune the performa...
متن کاملPRISM — Privacy-Preserving Search in MapReduce
We present PRISM, a privacy-preserving scheme for word search in cloud computing. Assuming a curious cloud provider, privacy of data stored in the cloud becomes an issue. The main challenge in the context of cloud computing is to design a scheme that achieves privacy while preserving the efficiency of cloud computing. Main approaches like simple encryption, Private Information Retrieval (PIR) a...
متن کامل